Skip to main content
This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any questions, please open an issue. If you want to know the meaning and usage of each parameter, click Server Arguments.
ArgumentDefaultsOptionsA2A3
--model-path
--model
NoneType: str
--tokenizer-pathNoneType: str
--tokenizer-modeautoauto, slow
--tokenizer-worker-num1Type: int
--skip-tokenizer-initFalsebool flag (set to enable)
--load-formatautoauto, safetensors
--model-loader-
extra-config
Type: str
--trust-remote-codeFalsebool flag (set to enable)
--context-lengthNoneType: int
--is-embeddingFalsebool flag (set to enable)
--enable-multimodalNonebool flag (set to enable)
--revisionNoneType: str
--model-implautoauto, sglang,
transformers
ArgumentDefaultsOptionsA2A3
--host127.0.0.1Type: str
--port30000Type: int
--skip-server-warmupFalsebool flag (set to enable)
--warmupsNoneType: str
--nccl-portNoneType: int
--fastapi-root-pathNoneType: str
--grpc-modeFalsebool flag (set to enable)
ArgumentDefaultsOptionsA2A3Special
--dtypeautoauto,
float16,
bfloat16
--quantizationNonemodelslim
--quantization-param-pathNoneType: str
--kv-cache-dtypeautoauto
--enable-fp32-lm-headFalsebool flag
(set to enable)
--modelopt-quantNoneType: str
--modelopt-checkpoint-
restore-path
NoneType: str
--modelopt-checkpoint-
save-path
NoneType: str
--modelopt-export-pathNoneType: str
--quantize-and-serveFalsebool flag
(set to enable)
--rl-quant-profileNoneType: str
ArgumentDefaultsOptionsA2A3
--mem-fraction-staticNoneType: float
--max-running-requestsNoneType: int
--prefill-max-requestsNoneType: int
--max-queued-requestsNoneType: int
--max-total-tokensNoneType: int
--chunked-prefill-sizeNoneType: int
--max-prefill-tokens16384Type: int
--schedule-policyfcfslpm, fcfs
--enable-priority-
scheduling
Falsebool flag
(set to enable)
--schedule-low-priority-
values-first
Falsebool flag
(set to enable)
--priority-scheduling-
preemption-threshold
10Type: int
--schedule-conservativeness1.0Type: float
--page-size128Type: int
--swa-full-tokens-ratio0.8Type: float
--disable-hybrid-swa-memoryFalsebool flag
(set to enable)
--abort-on-priority-
when-disabled
Falsebool flag
(set to enable)
--enable-dynamic-chunkingFalsebool flag
(set to enable)
ArgumentDefaultsOptionsA2A3
--deviceNoneType: str
--tensor-parallel-size
--tp-size
1Type: int
--pipeline-parallel-size
--pp-size
1Type: int
--pp-max-micro-batch-sizeNoneType: int
--pp-async-batch-depthNoneType: int
--stream-interval1Type: int
--stream-outputFalsebool flag (set to enable)
--random-seedNoneType: int
--constrained-json-
whitespace-pattern
NoneType: str
--constrained-json-
disable-any-whitespace
Falsebool flag (set to enable)
--watchdog-timeout300Type: float
--soft-watchdog-timeout300Type: float
--dist-timeoutNoneType: int
--base-gpu-id0Type: int
--gpu-id-step1Type: int
--sleep-on-idleFalsebool flag (set to enable)
--custom-sigquit-handlerNoneOptional[Callable]
ArgumentDefaultsOptionsA2A3Special
--log-levelinfoType: str
--log-level-httpNoneType: str
--log-requestsFalsebool flag
(set to enable)
--log-requests-level20, 1, 2, 3
--log-requests-formattexttext, json
--crash-dump-folderNoneType: str
--enable-metricsFalsebool flag
(set to enable)
--enable-metrics-for-
all-schedulers
Falsebool flag
(set to enable)
--tokenizer-metrics-
custom-labels-header
x-custom-labelsType: str
--tokenizer-metrics-
allowed-custom-labels
NoneList[str]
--bucket-time-to-
first-token
NoneList[float]
--bucket-inter-token-
latency
NoneList[float]
--bucket-e2e-request-
latency
NoneList[float]
--collect-tokens-
histogram
Falsebool flag
(set to enable)
--prompt-tokens-bucketsNoneList[str]
--generation-tokens-bucketsNoneList[str]
--gc-warning-threshold-secs0.0Type: float
--decode-log-interval40Type: int
--enable-request-time-
stats-logging
Falsebool flag
(set to enable)
--kv-events-configNoneType: str
--enable-traceFalsebool flag
(set to enable)
--oltp-traces-endpointlocalhost:4317Type: str
ArgumentDefaultsOptionsA2A3
--export-metrics-to-
file
Falsebool flag
(set to enable)
--export-metrics-to-
file-dir
NoneType: str
ArgumentDefaultsOptionsA2A3
--data-parallel-size
--dp-size
1Type: int
--load-balance-methodround_robinround_robin,
total_requests,
total_tokens
--prefill-round-robin-balanceFalsebool flag
(set to enable)
ArgumentDefaultsOptionsA2A3
--dist-init-addr
--nccl-init-addr
NoneType: str
--nnodes1Type: int
--node-rank0Type: int
ArgumentDefaultsOptionsA2A3
--json-model-override-
args
{}Type: str
--preferred-sampling-
params
NoneType: str
ArgumentDefaultsOptionsA2A3Special
--enable-loraFalseBool flag
(set to enable)
--max-lora-rankNoneType: int
--lora-target-modulesNoneall
--lora-pathsNoneType: List[str] /
JSON objects
--max-loras-per-batch8Type: int
--max-loaded-lorasNoneType: int
--lora-eviction-policylrulru,
fifo
--lora-backendtritontriton
--max-lora-chunk-size1616, 32,
64, 128
ArgumentDefaultsOptionsA2A3Special
--attention-backendNoneascend
--prefill-attention-backendNoneascend
--decode-attention-backendNoneascend
--sampling-backendNonepytorch,
ascend
--grammar-backendNonexgrammar
--mm-attention-backendNoneascend_attn
--nsa-prefill-backendflashmla_sparseflashmla_sparse,
flashmla_decode,
fa3,
tilelang,
aiter
--nsa-decode-backendfa3flashmla_prefill,
flashmla_kv,
fa3,
tilelang,
aiter
--fp8-gemm-backendautoauto,
deep_gemm,
flashinfer_trtllm,
cutlass,
triton,
aiter
--disable-flashinfer-
autotune
Falsebool flag
(set to enable)
ArgumentDefaultsOptionsA2A3Special
--speculative-algorithmNoneEAGLE3,
NEXTN
--speculative-draft-model-path
--speculative-draft-model
NoneType: str
--speculative-draft-model-
revision
NoneType: str
--speculative-draft-load-formatNoneauto
--speculative-num-stepsNoneType: int
--speculative-eagle-topkNoneType: int
--speculative-num-draft-tokensNoneType: int
--speculative-accept-
threshold-single
1.0Type: float
--speculative-accept-
threshold-acc
1.0Type: float
--speculative-token-mapNoneType: str
--speculative-attention-
mode
prefillprefill,
decode
--speculative-moe-runner-
backend
Noneauto
--speculative-moe-a2a-
backend
Noneascend_fuseep
--speculative-draft-attention-backendNoneascend
--speculative-draft-model-quantizationNoneunquant
ArgumentDefaultsOptionsA2A3Experimental
--speculative-ngram-
min-match-window-size
1Type: int
--speculative-ngram-
max-match-window-size
12Type: int
--speculative-ngram-
min-bfs-breadth
1Type: int
--speculative-ngram-
max-bfs-breadth
10Type: int
--speculative-ngram-
match-type
BFSBFS,
PROB
--speculative-ngram-
branch-length
18Type: int
--speculative-ngram-
capacity
10000000Type: int
ArgumentDefaultsOptionsA2A3Special
--expert-parallel-size
--ep-size
--ep
1Type: int
--moe-a2a-backendnonenone,
deepep,
ascend_fuseep
--moe-runner-backendautoauto, triton
--flashinfer-mxfp4-
moe-precision
defaultdefault,
bf16
--enable-flashinfer-
allreduce-fusion
Falsebool flag
(set to enable)
--deepep-modeautonormal,
low_latency,
auto
--deepep-configNoneType: str
--ep-num-redundant-experts0Type: int
--ep-dispatch-algorithmNoneType: str
--init-expert-locationtrivialType: str
--enable-eplbFalsebool flag
(set to enable)
--eplb-algorithmautoType: str
--eplb-rebalance-layers-
per-chunk
NoneType: int
--eplb-min-rebalancing-
utilization-threshold
1.0Type: float
--expert-distribution-
recorder-mode
NoneType: str
--expert-distribution-
recorder-buffer-size
NoneType: int
--enable-expert-distribution-
metrics
Falsebool flag (set to enable)
--moe-dense-tp-sizeNoneType: int
--elastic-ep-backendNonenone, mooncake
--mooncake-ib-deviceNoneType: str
ArgumentDefaultsOptionsA2A3
--max-mamba-cache-sizeNoneType: int
--mamba-ssm-dtypefloat32float32,
bfloat16
--mamba-full-memory-ratio0.2Type: float
--mamba-scheduler-strategyautoauto,
no_buffer,
extra_buffer
--mamba-track-interval256Type: int
ArgumentDefaultsOptionsA2A3Special
--enable-hierarchical-
cache
Falsebool flag
(set to enable)
--hicache-ratio2.0Type: float
--hicache-size0Type: int
--hicache-write-policywrite_throughwrite_back,
write_through,
write_through_selective
--radix-eviction-policylrulru, lfu
--hicache-io-backendkernelkernel_ascend,
direct
--hicache-mem-layoutlayer_firstpage_first_direct,
page_first_kv_split
--hicache-storage-
backend
Nonefile
--hicache-storage-
prefetch-policy
best_effortbest_effort,
wait_complete,
timeout
--hicache-storage-
backend-extra-config
NoneType: str
ArgumentDefaultsOptionsA2A3Special
--enable-lmcacheFalsebool flag
(set to enable)
ArgumentDefaultsOptionsA2A3
--cpu-offload-gb0Type: int
--offload-group-size-1Type: int
--offload-num-in-group1Type: int
--offload-prefetch-step1Type: int
--offload-modecpuType: str
ArgumentDefaultsOptionsA2A3
--multi-item-scoring-delimiterNoneType: int
ArgumentDefaultsOptionsA2A3SpecialPlanned
--disable-radix-cacheFalsebool flag
(set to enable)
--cuda-graph-max-bsNoneType: int
--cuda-graph-bsNoneList[int]
--disable-cuda-graphFalsebool flag
(set to enable)
--disable-cuda-graph-
padding
Falsebool flag
(set to enable)
--enable-profile-
cuda-graph
Falsebool flag
(set to enable)
--enable-cudagraph-gcFalsebool flag
(set to enable)
--enable-nccl-nvlsFalsebool flag
(set to enable)
--enable-symm-memFalsebool flag
(set to enable)
--disable-flashinfer-
cutlass-moe-fp4-allgather
Falsebool flag
(set to enable)
--enable-tokenizer-
batch-encode
Falsebool flag
(set to enable)
--disable-tokenizer-
batch-encode
Falsebool flag
(set to enable)
--disable-outlines-
disk-cache
Falsebool flag
(set to enable)
--disable-custom-
all-reduce
Falsebool flag
(set to enable)
--enable-mscclppFalsebool flag
(set to enable)
--enable-torch-
symm-mem
Falsebool flag
(set to enable)
--disable-overlap
-schedule
Falsebool flag
(set to enable)
--enable-mixed-
chunk
Falsebool flag
(set to enable)
--enable-dp-attentionFalsebool flag
(set to enable)
--enable-dp-lm-headFalsebool flag
(set to enable)
--enable-two-
batch-overlap
Falsebool flag
(set to enable)
--enable-single-
batch-overlap
Falsebool flag
(set to enable)
--tbo-token-
distribution-threshold
0.48Type: float
--enable-torch-
compile
Falsebool flag
(set to enable)
--enable-torch-
compile-debug-mode
Falsebool flag
(set to enable)
--enable-piecewise-
cuda-graph
Falsebool flag
(set to enable)
--piecewise-cuda-
graph-tokens
NoneType: JSON
list
--piecewise-cuda-
graph-compiler
eager[“eager”, “inductor”]
--torch-compile-max-bs32Type: int
--piecewise-cuda-
graph-max-tokens
4096Type: int
--torchao-configType: str
--enable-nan-detectionFalsebool flag
(set to enable)
--enable-p2p-checkFalsebool flag
(set to enable)
--triton-attention-
reduce-in-fp32
Falsebool flag
(set to enable)
--triton-attention-
num-kv-splits
8Type: int
--triton-attention-
split-tile-size
NoneType: int
--delete-ckpt-
after-loading
Falsebool flag
(set to enable)
--enable-memory-saverFalsebool flag
(set to enable)
--enable-weights-
cpu-backup
Falsebool flag
(set to enable)
--enable-draft-weights-
cpu-backup
Falsebool flag
(set to enable)
--allow-auto-truncateFalsebool flag
(set to enable)
--enable-custom-
logit-processor
Falsebool flag
(set to enable)
--flashinfer-mla-
disable-ragged
Falsebool flag
(set to enable)
--disable-shared-
experts-fusion
Falsebool flag
(set to enable)
--disable-chunked-
prefix-cache
Falsebool flag
(set to enable)
--disable-fast-
image-processor
Falsebool flag
(set to enable)
--keep-mm-feature-
on-device
Falsebool flag
(set to enable)
--enable-return-
hidden-states
Falsebool flag
(set to enable)
--enable-return-
routed-experts
Falsebool flag
(set to enable)
--scheduler-recv-
interval
1Type: int
--numa-nodeNoneList[int]
--rl-on-policy-targetNonefsdp
--enable-layerwise-
nvtx-marker
Falsebool flag
(set to enable)
--enable-attn-tp-
input-scattered
Falsebool flag
(set to enable)
--enable-nsa-prefill-
context-parallel
Falsebool flag
(set to enable)
--enable-fused-qk-
norm-rope
Falsebool flag
(set to enable)
ArgumentDefaultsOptionsA2A3
--enable-dynamic-
batch-tokenizer
Falsebool flag
(set to enable)
--dynamic-batch-
tokenizer-batch-size
32Type: int
--dynamic-batch-
tokenizer-batch-timeout
0.002Type: float
ArgumentDefaultsOptionsA2A3
--debug-tensor-dump-
output-folder
NoneType: str
--debug-tensor-dump-
layers
NoneList[int]
--debug-tensor-dump-
input-file
NoneType: str
ArgumentDefaultsOptionsA2A3Special
--disaggregation-modenullnull,
prefill,
decode
--disaggregation-transfer-backendmooncakeascend
--disaggregation-bootstrap-port8998Type: int
--disaggregation-decode-tpNoneType: int
--disaggregation-decode-dpNoneType: int
--disaggregation-ib-deviceNoneType: str
--disaggregation-decode-
enable-offload-kvcache
Falsebool flag
(set to enable)
--disaggregation-decode-
enable-fake-auto
Falsebool flag
(set to enable)
--num-reserved-decode-tokens512Type: int
--disaggregation-decode-
polling-interval
1Type: int
ArgumentDefaultsOptionsA2A3
--encoder-onlyFalsebool flag
(set to enable)
--language-onlyFalsebool flag
(set to enable)
--encoder-transfer-backendzmq_to_schedulerzmq_to_scheduler,
zmq_to_tokenizer,
mooncake
--encoder-urls[]List[str]
ArgumentDefaultsOptionsA2A3Special
--custom-weight-loaderNoneList[str]
--weight-loader-disable-
mmap
Falsebool flag
(set to enable)
--remote-instance-weight-
loader-seed-instance-ip
NoneType: str
--remote-instance-weight-
loader-seed-instance-service-port
NoneType: int
--remote-instance-weight-
loader-send-weights-group-ports
NoneType: JSON
list
--remote-instance-weight-
loader-backend
nccltransfer_engine,
nccl
--remote-instance-weight-
loader-start-seed-via-transfer-engine
Falsebool flag
(set to enable)
ArgumentDefaultsOptionsA2A3Special
--enable-pdmuxFalsebool flag
(set to enable)
--pdmux-config-pathNoneType: str
--sm-group-num8Type: int
ArgumentDefaultsOptionsA2A3
--mm-max-concurrent-calls32Type: int
--mm-per-request-timeout10.0Type: float
--enable-broadcast-mm-
inputs-process
Falsebool flag
(set to enable)
--mm-process-configNoneType: JSON / Dict
--mm-enable-dp-encoderFalsebool flag
(set to enable)
--limit-mm-data-per-requestNoneType: JSON / Dict
ArgumentDefaultsOptionsA2A3
--decrypted-config-fileNoneType: str
--decrypted-draft-config-fileNoneType: str
--enable-prefix-mm-cacheFalsebool flag
(set to enable)
ArgumentDefaultsOptionsA2A3Planned
--enable-deterministic-
inference
Falsebool flag
(set to enable)
ArgumentDefaultsOptionsA2A3
--forward-hooksNoneType: JSON list
ArgumentDefaultsOptionsA2A3
--configNoneType: str
The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.
ArgumentDefaultsOptions
--checkpoint-engine-
wait-weights-
before-ready
Falsebool flag (set to enable)
--kt-weight-pathNoneType: str
--kt-methodAMXINT4Type: str
--kt-cpuinferNoneType: int
--kt-threadpool-count2Type: int
--kt-num-gpu-expertsNoneType: int
--kt-max-deferred-
experts-per-token
NoneType: int
The following parameters have some functional deficiencies on community
ArgumentDefaultsOptions
--enable-double-sparsityFalsebool flag
(set to enable)
--ds-channel-config-pathNoneType: str
--ds-heavy-channel-num32Type: int
--ds-heavy-token-num256Type: int
--ds-heavy-channel-typeqkType: str
--ds-sparse-decode-
threshold
4096Type: int
--tool-serverNoneType: str